Rvocado

Welcome to the R project of Daniel Teixeira and Christian Bester. In this project, we will slice and dice the avocado dataset to see if we can get some meaningful insights from the much beloved fruit of the millenials!

The ultimate purpose of this project is to attempt some hypothesis testing and (hopefully) confirmation, using the statistical programming software, R.

1    Setting up the Workspace

In this section, we will set up the R workspace before we proceed with our statistical analysis.

1.1    Clearing the enviroment

rm(list = ls())

The first step is to clear the R environment. The whole point of working with R is reproducibility, which means that you shouldn’t be working with previously saved data. So we begin by clearing the R environment.

1.2    Loading packages

pkgs <- c("Hmisc", "ggfortify", "kableExtra", "lubridate", "scales", "sjPlot", "tidyverse")
sapply(pkgs, function(x) if(!x %in% installed.packages()) install.packages(x, repos = "http://cran.us.r-project.org"))
sapply(pkgs, library, character.only = TRUE)

As this is a reproducible research that will be shared with other researchers, the code needs to check if the packages used here is installed on the users PC before loading the package, otherwise an error will be thrown and the document will not be created.

This test is done using an sapply, which essentially loops over the vector of packages and checks if it is not listed in the installed packages. If this is true, the package is installed.

We then load the packages required into R, first by saving the list of packages we need to a vector. and then using sapply to load the packages in one line, rather than multiple calls to library. This is very useful if one has a multitude of packages.

1.3    Global options

options("scipen"=999)

This code removes scientific notation for up to 999 digits.

2    Exploratory Data Analysis

Exploratory Data Analysis (EDA) is very important in data analytics. It is through this process that one is able to formulate various hypotheses that can then be formally tested.

This section is therefore the longest portion of this document, which will go through various stages from the initial data imports, data cleaning and manipulation to data visualisation and hypothesis formation.

2.1    Data import

We read in the data using the read.csv function from the utils package, which is installed with R and is loaded when R is launched. We set check.names to FALSE to prevent periods being placed where there is a space character. We save the data to pdata, where p stands for project. Hence, the name of the data is project data. Always try to have informative names for objects when programming.

We use the names function, also from utils, to see the column names of the dataset, so that we can know the variable names of the dataset.

pdata <- read.csv("avocado.csv", check.names = FALSE)
names(pdata)

The output of the preceding command is as follows:

 [1] ""             "Date"         "AveragePrice" "Total Volume" "4046"        
 [6] "4225"         "4770"         "Total Bags"   "Small Bags"   "Large Bags"  
[11] "XLarge Bags"  "type"         "year"         "region"      

These are the variables collected in the dataset. Immediately, we can see that there is an issue with the first column, the column name is blank. So we have no clue of what is in it.

The snapshot of the preceding dataset is shown here using the glimpse function:

glimpse(pdata)
Rows: 18,249
Columns: 14
$ ``             <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
$ Date           <chr> "2015-12-27", "2015-12-20", "2015-12-13", "2015-12-0...
$ AveragePrice   <dbl> 1.33, 1.35, 0.93, 1.08, 1.28, 1.26, 0.99, 0.98, 1.02...
$ `Total Volume` <dbl> 64236.62, 54876.98, 118220.22, 78992.15, 51039.60, 5...
$ `4046`         <dbl> 1036.74, 674.28, 794.70, 1132.00, 941.48, 1184.27, 1...
$ `4225`         <dbl> 54454.85, 44638.81, 109149.67, 71976.41, 43838.39, 4...
$ `4770`         <dbl> 48.16, 58.33, 130.50, 72.58, 75.78, 43.61, 93.26, 80...
$ `Total Bags`   <dbl> 8696.87, 9505.56, 8145.35, 5811.16, 6183.95, 6683.91...
$ `Small Bags`   <dbl> 8603.62, 9408.07, 8042.21, 5677.40, 5986.26, 6556.47...
$ `Large Bags`   <dbl> 93.25, 97.49, 103.14, 133.76, 197.69, 127.44, 122.05...
$ `XLarge Bags`  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
$ type           <chr> "conventional", "conventional", "conventional", "con...
$ year           <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015...
$ region         <chr> "Albany", "Albany", "Albany", "Albany", "Albany", "A...

One might be tempted into believing that the first column in simply a row or observation count beginning from zero. This might not be the case. To test this, we reverse the dataframe and then use glimpse on the reversed data. What this does is allow us to see the observations at the end of the data through the glimpse function, which otherwise only shows data of the first few observations.

pdata %>% map_df(rev) %>% glimpse()
Rows: 18,249
Columns: 14
$ ``             <int> 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 11, 10, 9, 8, ...
$ Date           <chr> "2018-01-07", "2018-01-14", "2018-01-21", "2018-01-2...
$ AveragePrice   <dbl> 1.62, 1.93, 1.87, 1.71, 1.63, 1.57, 1.56, 1.57, 1.54...
$ `Total Volume` <dbl> 17489.58, 16205.22, 13766.76, 13888.04, 17074.83, 15...
$ `4046`         <dbl> 2894.77, 1527.63, 1191.92, 1191.70, 2046.96, 1924.28...
$ `4225`         <dbl> 2356.13, 2981.04, 2452.79, 3431.50, 1529.20, 1368.32...
$ `4770`         <dbl> 224.53, 727.01, 727.94, 0.00, 0.00, 0.00, 0.00, 0.00...
$ `Total Bags`   <dbl> 12014.15, 10969.54, 9394.11, 9264.84, 13498.67, 1269...
$ `Small Bags`   <dbl> 11988.14, 10919.54, 9351.80, 8940.04, 13066.82, 1243...
$ `Large Bags`   <dbl> 26.01, 50.00, 42.31, 324.80, 431.85, 256.22, 223.18,...
$ `XLarge Bags`  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
$ type           <chr> "organic", "organic", "organic", "organic", "organic...
$ year           <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018...
$ region         <chr> "WestTexNewMexico", "WestTexNewMexico", "WestTexNewM...

So, the nameless column is not equal to the number of observations in the data. We will give it an arbitrary name so that it can be selected in operations.

colnames(pdata)[1] <- "code"

2.1    Descriptive Statistics

Descriptive statistics is a method of summarizing a dataset quantitatively. These summaries can be simple quantitative statements about the data or a visual representation sufficient enough to be part of the initial description about the dataset.

To get a set of summary statistics, one can use the base R function, summary. However, this function is better used for numerical data. The avocado dataset has numerical and categorical data. The Hmisc package has a function called describe, which provides extra level of summary statistics. In addition to what is given by summary, such as missing values, distribution of numerical variables, and distinct values of categorical variables.

d <- describe(pdata)

We don’t print the output to the console so that it is formatted as html, using the html function, also from Hmisc, because the output of this document is html file. We specifically instruct R to import the function from Hmisc to avoid potential namespace conflicts with the html function from the htmltools package.

Hmisc::html(describe(pdata), size=85, tabular=TRUE,
      greek=TRUE, scroll=TRUE, rows=25, cols=125,
      border = 2)
pdata

14 Variables   18249 Observations

code
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
18249053124.2317.85 2 41024384649
lowest : 0 1 2 3 4 , highest: 48 49 50 51 52
Date
nmissingdistinct
182490169
lowest :2015-01-042015-01-112015-01-182015-01-252015-02-01
highest:2018-02-252018-03-042018-03-112018-03-182018-03-25

AveragePrice
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
18249025911.4060.45120.830.931.101.371.661.932.11
lowest : 0.44 0.46 0.48 0.49 0.51 , highest: 3.04 3.05 3.12 3.17 3.25
Total Volume
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
    18249        0    18237        1   850644  1450905     2372     3897    10839 
      .50      .75      .90      .95 
   107377   432962  1387046  3716315 
 
lowest : 84.56 379.82 385.55 419.98 472.82
highest:46324529.7047293921.6052288697.8961034457.1062505646.52

4046
image
          n    missing   distinct       Info       Mean        Gmd        .05 
      18249          0      17702          1     293008     524489      19.60 
        .10        .25        .50        .75        .90        .95 
      94.28     854.07    8645.30  111020.20  538385.18 1263359.68 
 
lowest : 0.00 1.00 1.13 1.19 1.20
highest:17076650.8217787611.9318933038.0421620180.9022743616.17

4225
image
         n   missing  distinct      Info      Mean       Gmd       .05       .10 
     18249         0     18103         1    295155    509218     103.6     367.5 
       .25       .50       .75       .90       .95 
    3008.8   29061.0  150206.9  500784.6 1303657.7 
 
lowest : 0.00 1.26 1.28 1.30 1.31
highest:17896391.6018956479.7420328161.5520445501.0320470572.61

4770
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
    18249        0    12071    0.973    22840    41861        0        0        0 
      .50      .75      .90      .95 
      185     6243    31492   106157 
 
lowest : 0.00 0.83 1.00 1.01 1.09
highest:1811090.711880231.381896149.501993645.362546439.11

Total Bags
image
         n   missing  distinct      Info      Mean       Gmd       .05       .10 
     18249         0     18097         1    239639    405113     628.9    1299.2 
       .25       .50       .75       .90       .95 
    5088.6   39743.8  110783.4  442141.9 1005478.9 
 
lowest : 0.00 3.09 3.11 3.19 3.33
highest:15804696.3115972492.0716298296.2916394524.1119373134.37

Small Bags
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
    18249        0    17321        1   182195   309915    256.7    583.1   2849.4 
      .50      .75      .90      .95 
  26362.8  83337.7 354266.9 768147.2 
 
lowest : 0.00 2.52 2.57 2.73 2.79
highest:11392828.8911712807.1912540327.1912567155.5813384586.80

Large Bags
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
    18249        0    15082    0.998    54338    96628      0.0      0.0    127.5 
      .50      .75      .90      .95 
   2647.7  22029.2  94295.3 195699.8 
 
lowest : 0.00 0.97 1.30 1.33 1.38
highest:3988101.744023485.044081397.724324231.195719096.61

XLarge Bags
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
    18249        0     5588    0.712     3106     5871      0.0      0.0      0.0 
      .50      .75      .90      .95 
      0.0    132.5   3688.9  12058.5 
 
lowest : 0.00 1.00 1.11 1.26 1.30
highest:377661.06387400.22390478.73454343.65551693.65

type
nmissingdistinct
1824902
 Value      conventional      organic
 Frequency          9126         9123
 Proportion          0.5          0.5
 

year
image
nmissingdistinctInfoMeanGmd
18249040.91120161.031
 Value       2015  2016  2017  2018
 Frequency   5615  5616  5722  1296
 Proportion 0.308 0.308 0.314 0.071
 

region
image
nmissingdistinct
18249054
lowest :Albany Atlanta BaltimoreWashingtonBoise Boston
highest:Syracuse Tampa TotalUS West WestTexNewMexico

2.1.1    Standard deviation

There are a few more functions in R that are useful for summary analysis, which can be applied to the variables individually. To get to know the standard deviation of the dataset, we can use the sd function that can be applied to the numerical variables only. To do this, we use sumarise_if to the data, where only the variables that are numeric are selected for the summary operation

data_sd <- pdata %>%
  summarise_if(is.numeric, funs(sd))

Here is the output from the above operation

code AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags year
15.48104 0.4026766 3453545 1264989 1204120 107464.1 986242.4 746178.5 243966 17692.89 0.9399385

2.1.2    Box plot

We can represent the data presented by summary in a graphical format using a boxplot. We use the ggplot2 package to create the box plot. The box plot can be plotted for the numerical columns only; hence in the aes call, we map the type of avocado to x and the average price to y.

Inside the geom_boxplot call, we specify the color, shape and size of the outliers. The theme_classic function gives a similar theme to the base plot function.

ggplot(pdata, aes(x = type, y = AveragePrice, fill = type))+
  geom_boxplot(outlier.color = "#192812", outlier.shape = 16, outlier.size = 2)+
  scale_fill_manual(values = c("#2b501c", "#8cb575"))+
  theme_classic()+
  theme(legend.position="bottom")

This box plot shows the median, first quartile, and third quartile values for all the variables. The outliers can also be shown in the dataset. In the following boxplot, the outliers have been enabled:

We can see from the box plot above that organic avocados are more expensive on average than conventional avocados. Moreover, the outliers of the organic avocado are more extreme, showing that the price of the organic avocado can swing more violently than conventional avocados.

2.1.3    Density plot

A density plot shows the distribution of the underlying data. Here we use a density plot to show the distribution of the sales volume of the types of avocado. Note that we transform the data by applying a log scale to the x data, which is volume, to circumvent large outliers from distorting the plot. This will allow us to see which type of avocado does more sales volume.

pdata %>%
ggplot(aes(x=`Total Volume`, fill=type))+
  scale_fill_manual(values=c("#2b501c", "#8cb575"))+
  scale_x_log10()+
  geom_density(alpha=0.7)+
  theme_classic()+
  theme(legend.position="bottom")

As can be seen from the plot below, the conventional avocado has a higher sales volume than the organic avocado. Is this because of the price?

2.1.4    Scatterplot

Let’s look at how a few different variables relate to each other, e.g: Average price and volume. The law of demand and supply states that as supply increases, price reduces. Is this the case with Avocado? Let us find out.

pdata %>% 
  ggplot(aes(`Total Volume`, AveragePrice, col=type)) + 
  geom_point() + 
  geom_smooth(method="lm")+
  scale_x_log10()+
  scale_color_manual(values = c("#2b501c", "#8cb575"))+
  theme_classic()+
  theme(legend.position="bottom")

The output from the code above is shown below:

As you can see, an increase in the volume of avocado tends to reduce the price.

2.1.5    Bar chart

We will use a bar chart to show the monthly price swings of avocados. We will filter the regions to the total US region. The code below creates a new column which shows the percentage change in the price.

growth.mon <- pdata %>%
  filter(region == "TotalUS") %>%
  select(Date, AveragePrice, type) %>%
  arrange(type, Date) %>%
  group_by(type) %>%
  mutate(Rate = (AveragePrice - lag(AveragePrice)) / lag(AveragePrice), 
         Date = ymd(Date)) %>%
  select(-AveragePrice)

growth.mon[is.na(growth.mon)] <- 0

The code below uses multiple calls to geom_rect with different fills to create a striped background.

ggplot(growth.mon, aes(x = Date, y = Rate))+
  geom_rect(ymin = 0.8, ymax = Inf,
            xmin = 0, xmax = 1000000, fill = '#f5f6f9')+
  geom_rect(ymin = 0.6, ymax = 0.8, 
            xmin = 0, xmax = 1000000, fill = '#fbfcfc') +
  geom_rect(ymin = 0.4, ymax = 0.6,
            xmin = 0, xmax = 1000000, fill = '#f5f6f9')+
  geom_rect(ymin = 0.2, ymax = 0.4, 
            xmin = 0, xmax = 1000000, fill = '#fbfcfc')+ 
  geom_rect(ymin = 0, ymax = 0.2,
            xmin = 0, xmax = 1000000, fill = '#f5f6f9')+
  geom_rect(ymin = -0.2, ymax = 0,
            xmin = 0, xmax = 1000000, fill = '#fbfcfc')+
  geom_rect(ymin = -0.4, ymax = -0.2, 
            xmin = 0, xmax = 1000000, fill = '#f5f6f9')+
  geom_rect(ymin = -0.6, ymax = -0.4, 
            xmin = 0, xmax = 1000000, fill = '#fbfcfc')+
  geom_bar(stat = "identity", aes(fill = type), 
           show.legend = TRUE, position = "dodge")+
  # scale_fill_continuous(low = "#0095db")+
  scale_fill_manual(name = "Change in average price", values = c("#2b501c", "#43362e"))+
  scale_y_continuous(NULL, labels = percent_format(), limits = c(min(growth.mon$Rate), max(growth.mon$Rate)))+
  scale_x_date(date_breaks = "1 month", date_labels =  "%b %Y")+
  theme(axis.text.x=element_text(angle=60, hjust=1))+
  theme(legend.position = "top",
        legend.background = element_rect(color = NA),
        legend.key = element_rect(fill = NA, color = NA))

The output of the code above is shown below:

The graph shows that from the tail end of 2017, price changes of the conventional type of avocado have been more severe than price changes of the organic type. What could have caused this?

2.1.6    Price Chart

Finally, we look at the progression of the average avocado price.

pdata %>%
  mutate(Date = ymd(Date))%>%
  ggplot(aes(x=Date, y=AveragePrice, color=type))+
  geom_line()+
  theme_classic()+
  scale_color_manual(values = c("#2b501c", "#ed6325"))+
  scale_x_date(date_breaks = "1 month", date_labels =  "%b %Y",expand=c(0,0))+
  theme(axis.text.x=element_text(angle=60, hjust=1))+
  theme(legend.position = "top")

Output:

Looks like the average price of the organic avocado has a wider range than the conventional avocado.

3    Linear Models

A linear model is a way of mathematically representing a relationship between two variables. This relationship can be shown to be statistically significant, in which case the relationship is likely to be real or statistically insignificant, in which case the relationship is virtually non-existent.

Performing a linear regression in R is remarkably simple, since R is equipped with the easy to use lm function. In this section, we will be fitting linear models to the relationships we visualised in the previous section.

3.1    Linear regression

The most interesting observation we made in the previous section was that the price of avocados goes down as volume sold goes up. We will now test if this relationship is statistically significant.

conmod <- pdata %>%
  filter(type=="conventional") %>%
  mutate(Volume = log(`Total Volume`)) %>%
  lm(AveragePrice ~ Volume, data = .)

orgmod <- pdata %>%
  filter(type=="organic") %>%
  mutate(Volume = log(`Total Volume`)) %>%
  lm(AveragePrice ~ Volume, data = .)

We save the model in order to display the results in a table, rather than printing to the console. To this, we use the tab_model function from the sjPlot package.


tab_model(conmod, orgmod)

The code chunk outputs the table below:


tab_model(conmod, orgmod)
  AveragePrice AveragePrice
Predictors Estimates CI p Estimates CI p
(Intercept) 1.75 1.70 – 1.81 <0.001 2.07 2.02 – 2.12 <0.001
Volume -0.05 -0.05 – -0.04 <0.001 -0.04 -0.05 – -0.04 <0.001
Observations 9126 9123
R2 / R2 adjusted 0.053 / 0.053 0.030 / 0.030

The results show that indeed, the average price of avocado decreases as volume increases. The p-value is shows that the negative coefficient is statistically significant.

3.2    Regression diagnostics

We examine if the fitted model is any good. A quick way to do this is with autoplot and ggfortify installed. This will give a ggplot2 version of the regression diagnostics plot from base R.

autoplot(conmod)
autoplot(orgmod)


The output is shown below

autoplot(conmod)


autoplot(orgmod)

We can see that the model roughly follows a normal distribution as the deviations are not too steep. Furthermore, the standard deviation of the residuals does not exceed 3 and are mostly below 2.

We can therefore conclude that the price of avocado drops as volume increases.